Goto

Collaborating Authors

 function class


Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function

Neural Information Processing Systems

Modern machine learning algorithms, especially deep learning-based techniques, typically involve careful hyperparameter tuning to achieve the best performance. Despite the surge of intense interest in practical techniques like Bayesian optimization and random search-based approaches to automating this laborious and compute-intensive task, the fundamental learning-theoretic complexity of tuning hyperparameters for deep neural networks is poorly understood. Inspired by this glaring gap, we initiate the formal study of hyperparameter tuning complexity in deep learning through a recently introduced data-driven setting. We assume that we have a series of learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks. A major difficulty is that the utility function as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. To tackle this challenge, we introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance as we vary the hyperparameter; our analysis relies on subtle concepts including tools from algebraic geometry, differential geometry and constrained optimization. We use this to show that the learning-theoretic complexity of the corresponding family of utility functions is bounded. We instantiate our results and provide sample complexity bounds for concrete applications--tuning a hyperparameter that interpolates neural activation functions and setting the kernel parameter in graph neural networks.


Uniform Wrappers: Bridging Concave to Quadratizable Functions in Online Optimization

Neural Information Processing Systems

This paper presents novel contributions to the field of online optimization, particularly focusing on the adaptation of algorithms from concave optimization to more challenging classes of functions. Key contributions include the introduction of uniform wrappers, a class of meta-algorithms that could be used for algorithmic conversions such as converting algorithms for convex optimization into those for quadratizable optimization. Moreover, we propose a guideline that, given a base algorithm Afor concave optimization and a uniform wrapper W, describes how to convert a proof of the regret bound of A in the concave setting into a proof of the regret bound of W(A)for quadratizable setting. Through this framework, the paper demonstrates improved regret guarantees for various classes of DR-submodular functions under zeroth-order feedback. Furthermore, the paper extends zeroth-order online algorithms to bandit feedback and offline counterparts, achieving notable improvements in regret/sample complexity compared to existing approaches.


Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

Neural Information Processing Systems

We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs--a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions compared to low-norm solutions (i.e., weight decay), which are known not to suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of "neural shattering" where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.


Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

Neural Information Processing Systems

We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KTR_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $R_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle.


Conditional Predictive Inference for General Structured Data with Group Symmetries

arXiv.org Machine Learning

We study distribution-free predictive inference for data with group symmetries, aiming to establish near-conditional coverage guarantees beyond exchangeability for structured data. While many predictive inference methods achieve a target coverage level, most provide marginal coverage. In practice, conditional predictive inference is often preferred, as it quantifies uncertainty for black-box predictions given observed attributes, thereby accommodating heterogeneity. Although many efforts have pursued efficient conditional coverage, existing methods rely on the i.i.d. or exchangeable assumption, often violated in structured settings such as networks, clusters, and imaging data. Recently, SymmPI introduced a unified approach to predictive inference under group symmetries beyond exchangeability; nevertheless, its guarantees remain marginal and do not account for population heterogeneity. To bridge this gap, we introduce C-SymmPI, a framework that achieves near-conditional coverage under general data structures with group symmetries, extending beyond exchangeability to cover networks, cluster-level data, and related structures. Inspired by relaxed multi-accuracy, our approach reformulates conditional coverage as miscoverage error over a user-specified function class. We establish theoretical guarantees under distributional invariance and distribution shift, and derive convergence rates for linear and RKHS function classes, recovering state-of-the-art results in the exchangeable setting as special cases. For computational efficiency, we develop two variants: a projection-based algorithm for high-dimensional observations, and a sampling-based algorithm for large or infinite groups. We demonstrate effectiveness on hierarchical and network data. Empirical results show that C-SymmPI delivers more informative and stable conditional coverage with improved accuracy compared to existing methods.


New Bounds for Hyperparameter Tuning of Regression Problems Across Instances

Neural Information Processing Systems

The task of tuning regularization coefficients in regularized regression models with provable guarantees across problem instances still poses a significant challenge in the literature. This paper investigates the sample complexity of tuning regularization parameters in linear and logistic regressions under ℓ1 and ℓ2-constraints in the data-driven setting. For the linear regression problem, by more carefully exploiting the structure of the dual function class, we provide a new upper bound for the pseudo-dimension of the validation loss function class, which significantly improves the best-known results on the problem. Remarkably, we also instantiate the first matching lower bound, proving our results are tight. For tuning the regularization parameters of logistic regression, we introduce a new approach to studying the learning guarantee via an approximation of the validation loss function class. We examine the pseudo-dimension of the approximation class and construct a uniform error bound between the validation loss function class and its approximation, which allows us to instantiate the first learning guarantee for the problem of tuning logistic regression regularization coefficients.


Selective Sampling and Imitation Learning via Online Regression

Neural Information Processing Systems

We consider the problem of Imitation Learning (IL) by actively querying noisy expert for feedback. While imitation learning has been empirically successful, much of prior work assumes access to noiseless expert feedback which is not practical in many applications. In fact, when one only has access to noisy expert feedback, algorithms that rely on purely offline data (non-interactive IL) can be shown to need a prohibitively large number of samples to be successful. In contrast, in this work, we provide an interactive algorithm for IL that uses selective sampling to actively query the noisy expert for feedback. Our contributions are twofold: First, we provide a new selective sampling algorithm that works with general function classes and multiple actions, and obtains the best-known bounds for the regret and the number of queries.



related work

Neural Information Processing Systems

Deterministic RLDeterministic system is often the starting case in the study of sample-efficient algorithms, where the issue of exploration and exploitation trade-off is more clearly revealed since both the transition kernel and reward function are deterministic. The seminal work [81] proposes a sample-efficient algorithm for Q-learning that works for a family of function classes. Recently, [21] studies the agnostic setting where the optimal Q-function can only be approximated by a function class with approximation error. The algorithm in [21] learns the optimal policy with the number of trajectories linear with the eluder dimension. Consider MDPM where the transition is deterministic. Assume the function class in Definition 3.1 satisfies Assumption 2.1 and Assumption 2.2. For any t (0,1), if d Ω(log(BW/λ))and n d poly(κ,k,λ,BW,Bϕ,H,log(d/t)), then with probability at least 1 tAlgorithm 1 returns the optimal policy π .